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COVARIANCE MATRIX ESTIMATION FOR STATIONARY 

TIME SERIES^ 

By Han Xiao and Wei Biao Wu 

University of Chicago 

We obtain a sharp convergence rate for banded covariance ma- 
trix estimates of stationary processes. A precise order of magnitude 
is derived for spectral radius of sample covariance matrices. We also 
consider a thresholded covariance matrix estimator that can better 
characterize sparsity if the true covariance matrix is sparse. As our 
main tool, we implement Toeplitz [Math. Ann. 70 (1911) 351-376] 
idea and relate eigenvalues of covariance matrices to the spectral 
densities or Fourier transforms of the covariances. We develop a large 
deviation result for quadratic forms of stationary processes using 
m-dependence approximation, under the framework of causal rep- 
resentation and physical dependence measures. 

1. Introduction. One hundred years ago, in 1911, Toeplitz obtained a deep 
result on eigenvalues of infinite matrices of the form Sqo = {as-t)ft=^oo- We 
say that A is an eigenvalue of Sqo if the matrix Sqo — Aldoo does not have 
a bounded inverse, where Idoo denotes the infinite-dimensional identity ma- 
trix. Toeplitz proved that, interestingly, the set of eigenvalues is the same 
as the image set {g{9),6 G [0, 27r]}, where 

(1) g{e) = ate'^^ with i = V^. 

Note that g{6) is the Fourier transform of the sequence {at)^_.ao - ^'^^ ^ finite 
T X T matrix = {as-t)i<s,t<T, its eigenvalues are approximately equally 
distributed (in the sense of Hermann Weyl) as {g{ujs), s = 0, . . . ,T — 1}, 
where ujs = 2tt s/T are the Fourier frequencies. See the excellent monograph 
by Grenander and Szego (1958) for a detailed account. 
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Covariance matrix is of fundamental importance in many aspects of statis- 
tics including multivariate analysis, principal component analysis, linear dis- 
criminant analysis and graphical modeling. One can infer dependence struc- 
tures among variables by estimating the associated covariance matrices. In 
the context of stationary time series analysis, due to stationarity, the co- 
variance matrix is Toeplitz in that, along the off-diagonals that are parallel 
to the main diagonal, the values are constant. Let {Xt)t£i be a stationary 
process with mean /i = KXt , and denote by 7^ = IE[(Xo — /x) (X^ — A*)] , G Z, 
its autocovariances. Then 

(2) St = hs-t)i<s,t<T 

is the autocovariance matrix of (Xi,...,Xt). In the rest of the paper for 
simplicity we also call (2) the covariance matrix of (Xi, . . . , Xt)- In time se- 
ries analysis it plays a crucial role in prediction [Kolmogoroff (1941), Wiener 
(1949)], smoothing and best linear unbiased estimation (BLUE). For exam- 
ple, in the Wiener-Kolmogorov prediction theory, one predicts Xt+i based 
on past observations Xt,Xt-i, .... If the covariances 7^ were known, given 
observations Xi,. . . ,Xt, the coefficients of the best linear unbiased predic- 
tor Xt+1 = Y1^=i'^t,sXt+i~s in terms of the mean square error ||Xt+i — 
are the solution of the discrete Wiener-Hopf equation 

where = {aT,i,aT,2, ■ ■ ■ , o-t,t)^ and = (71, 72, • • • , 1t)^ , and we use the 
superscript T to denote the transpose of a vector or a matrix. If 7^ are not 
known, we need to estimate them from the sample Xi, . . . ,Xt, and a good 
estimate of is required. As another example, suppose now (i = EXt 7^ 
and we want to estimate it from Xi^ . . . ,Xt by the form fi = Ylt=i'^tXt, 
where q satisfy the constraint X]t=i = 1- To obtain the BLUE, one min- 
imizes E(/z — fi)"^ subject to Ylt=i'^t = 1, ensuring unbiasedness. Note that 
the usual choice q = 1/T may not lead to BLUE. The optimal coefficients 
are given by (ci, . . . , cr)"^ = {l'^T.-^l)-'^T.-^l, where 1 = (1, . . . , 1)"^ gM^; 
see Adenstedt (1974). Again a good estimate of is needed. 

Given observations Xi,X2, ■ ■ ■ , Xt, assuming at the outset that EXt = 0, 
we can naturally estimate S-p via plug-in by the sample version 

1 ^ 

(3) tr = {^s-t)i<s,t<T where 7fc = - ^ Xt_\f,\Xt. 

t=\k\+l 

To judge the quality of a matrix estimate, we use the operator norm. The 
term "operator norm" usually indicates a class of matrix norms; in this 
paper it refers to the £2/^2 operator norm or spectral radius defined as 
\{A) := max|x|=i|^x| for any symmetric real matrix A, where x is a real 

vector, and |x| denotes its Euclidean norm. For the estimate T,t in (3), un- 
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fortunately, because too many parameters or autocovariances are estimated 
and the signal-to-noise ratios are too small at large lags, this estimate is 
not consistent. Wu and Pourahmadi (2009) showed that A(St — Sy) 
in probability. In Section 2 we provide a precise order of magnitude of 
\{Tjt — St) and give explicit upper and lower bounds. 

The inconsistency of sample covariance matrices has also been observed in 
the context of high-dimensional multivariate analysis, and this phenomenon 
is now well understood, thanks to the results from random matrix theory. 
See, among others, Marcenko and Pastur (1967), Bai and Yin (1993) and 
Johnstone (2001). Recently, there is a surge of interest on regularized co- 
variance matrix estimation in high-dimensional statistical inference. We only 
sample a few works which are closely related to our problem. Cai, Zhang and 
Zhou (2010), Bickel and Levina (2008b) and Furrer and Bengtsson (2007) 
studied the banding and/or tapering methods, while Bickel and Levina 
(2008a) and El Karoui (2008) considered the regularization by thresholding. 
In most of these works, convergence rates of the estimates were established. 

However, none of the techniques used in the aforementioned papers is ap- 
plicable in our setting since their estimates require multiple independent and 
identically distributed (i.i.d.) copies of random vectors from the underlying 
multivariate distribution, though the number of copies can be far less than 
the dimension of the vector. In time series analysis, however, it is typical 
that only one realization is available. Hence we shall naturally use the sam- 
ple autocovariances. In a companion paper, Xiao and Wu (2011) established 
a systematic theory for C? and deviations of sample autocovariances. 
Based on that, we adopt the regularization idea and study properties of the 
banded, tapered and thresholded estimates of the covariance matrices. Wu 
and Pourahmadi (2009) and McMurry and Politis (2010) applied the band- 
ing and tapering methods to the same problem, but here we shall obtain 
a better and optimal convergence rate. We shall point out that the regular- 
ization ideas of banding and tapering are not novel in time series analysis 
and they have been applied in nonparametric spectral density estimation. 

In this paper we use the ideas in Toeplitz (1911) and Grenander and 
Szego (1958) together with Wu's (2005) recent theory on stationary pro- 
cesses to present a systematic theory for estimates of covariance matrices of 
stationary processes. In particular, we shall exploit the connection between 
covariance matrices and spectral density functions and prove a sharp conver- 
gence rate for banded covariance matrix estimates of stationary processes. 
Using convergence properties of periodograms, we derive a precise order of 
magnitude for spectral radius of sample covariance matrices. We also con- 
sider a thresholded covariance matrix estimator that can better characterize 
sparsity if the true covariance matrix is sparse. As a main technical tool, we 
develop a large deviation type result for quadratic forms of stationary pro- 
cesses using ?n-dependence approximation, under the framework of causal 
representations and physical dependence measures. 
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The rest of this article is organized as follows. In Section 2 we introduce 
the framework of causal representation and physical dependence measures 
that are useful for studying convergence properties of covariance matrix 
estimates. We provide in Section 2 upper and lower bounds for the op- 
erator norm of the sample covariance matrices. The convergence rates of 
banded /tapered and thresholded sample covariance matrices are established 
in Sections 3 and 4, respectively. We also conduct a simulation study to com- 
pare the finite sample performances of banded and thresholded estimates in 
Section 5. Some useful moment inequalities are collected in Section 6. A large 
deviation result about quadratic forms of stationary processes, which is of 
independent interest, is given in Section 7. Section 8 concludes the paper. 

We now introduce some notation. For a random variable X and p > 0, 
we write X ^ CP if ||X||p := (E|X|p)-^/p < oo, and use ||X|| as a shorthand 
for ||X||2. To express centering of random variables concisely, we define the 
operator Eq as Eo(X) := X - EX. Hence Eo(Eo(X)) =Eo(X). For a sym- 
metric real matrix ^4, we use Aniin(^) and Amax(^) for its smallest and 
largest eigenvalues, respectively, and use A(A) to denote its operator norm 
or spectral radius. For a real number x, \x\ := max{y S Z:y < x} denotes 
its integer part and \x\ := m.m{y G Z:y > x}. For two real numbers x,y, 
set xV y = max{x, y} and x Ay:= min{x, y}. For two sequences of positive 
numbers (a-r) and (6r)i we write ay x 6t if there exists some constant C > 1 
such that < ax/bx < C for all T. The letter C denotes a constant, whose 
values may vary from place to place. We sometimes add symbolic subscripts 
to emphasize that the value of C depends on the subscripts. 

2. Exact order of operator norms of sample covariance matrices. Sup- 
pose Y is a p X n random matrix consisting of i.i.d. entries with mean 
and variance 1, which could be viewed as a sample of size n from some 
p-dimensional population; then YY~^ /n is the sample covariance matrix. If 
liuin^oo p/n = c > 0, then YY~^ /n is not a consistent estimate of the popula- 
tion covariance matrix (which is the identity matrix) in term of the operator 
norm. This is a well-known phenomenon in random matrices literature; see, 
for example, Marcenko and Pastur (1967), Section 5.2 in Bai and Silverstein 
(2010), Johnstone (2001) and El Karoui (2005). However, the techniques 
used in those papers are not applicable here, because we have only one re- 
alization and the dependence within the vector can be quite complicated. 
Thanks to the Toeplitz structure of T,t, our method depends on the con- 
nection between its eigenvalues and the spectral density, defined by 



(4) 




fcez 



The following lemma is a special case of Section 5.2 [Grenander and Szego 
(1958)]. 
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Lemma 1. Let h be a continuous symmetric function on [— 7r,7r]. De- 
note by h and h its minimum and maximum, respectively. Define = 
f^^h{9)e~^^^ dO and the T xT matrix Ft = {as-t)i<s,t<T; then 

Lemma 1 can be easily proved by noting that 
(5) x^Ftx = r \jJ p{e)\'^h{e) de where p{e) = (e*^ e*2^, . . . , e'^Y ■ 

J ~n 

The sample covariance matrix (3) is closely related to the periodogram 



/T(») = r-' 



t=l 



k=l-T 



By Lemma 1, we have X{Tit) < niax_^<6i<,r -^tI^*). Asymptotic properties of 
periodograms have recently been studied by Peligrad and Wu (2010) and 
Lin and Liu (2009). To introduce the result in the latter paper, we assume 
that the process (Xt) has the causal representation 

(6) Xt=g{£t,£t-.i,...), 

where g is a measurable function such that Xt is well defined, and £t,t gZ, 
are i.i.d. random variables. The framework (6) is very general [see, e.g., 
Tong (1990)] and easy to work with. Let = (e^, ef_i, . . .) be the set of 
innovations up to time t; we write Xt = g{^)- Let ej,i € Z, be an i.i.d. copy 
of Et, t G Z. Define = (ej, . . . , ei, eg, e_i, . . .), obtained by replacing eq in 
by Eq, and set X't = g{J^)- Following Wu (2005), for p > 0, define 

oo 

(7) @p{m) = J2^pit)^ m>0, wheve Sp{t) = \\Xt - X'tWp. 

t—m 

In Wu (2005), the quantity 5p{t) is called physical dependence measure. We 
make the convention that 5p{t) = for t < 0. Throughout the article, we as- 
sume the short-range dependence condition Qp := 0p(O) < oo. Under a mild 
condition on the tail sum 0p(m) (cf. Theorem 2), Lin and Liu (2009) ob- 
tained the weak convergence result 

I?gi 2'j/'(2W?) }-'°g^^g' 

where =^ denotes the convergence in distribution, Q is the Gumbel distri- 
bution with the distribution function e~'^ , and q = \T/2 \ — 1. Using this 
result, we can provide explicit upper and lower bounds on the operator norm 
of the sample covariance matrix. 
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Theorem 2. Assume Xt G for some p>2 and EXt = 0. If @p{m) 
0(1/ log m) and ming f (9) > , then 

7r[mmef{9)JlogT 



According to Lemma 1, we know Ainax(5^T) ^ 2tt maxg f (6) . As an imme- 
diate consequence of Theorem 2, there exists a constant C > 1 such that 

hm P[C"^logT<A(ST-ST)<ClogT] = l, 

T— >oo 

which means the estimate T,t is not consistent, and the order of magnitude 
of A(Sq" — Tit) is logT. Earher, Wu and Pourahmadi (2009) also showed that 

the plug-in estimate St = {'js-t)i<s,t<T is not consistent, namely, A(St — 
S) 7^ in probability. Our Proposition 2 substantially refines this result by 
providing an exact order of magnitude of X{T,t — 5^)- 

An, Chen and Hannan (1983) showed that under suitable conditions, for 
linear processes with i.i.d. innovations, 

(9) lim max{ It (6*)/ [2vr/(6l) log T]} = 1 almost surely. 

T— >co 6 

A stronger version was found by Turkman and Walker (1990) for Gaussian 
processes. Based on (9), we conjecture that 

A(St) 

lim 777r-, — 7;: = 1 almost surely. 

r^oo 27rmaxe/(6')logT 

Turkman and Walker (1984) established the following result on the maxi- 
mum periodogram of a sequence of i.i.d. standard normal random variables: 

T 1 ^ log(logT) , log(3/^) ^ ^ 

(10) max/T(6') - logT \ ^g. 



In view of (8) and (10), we conjecture that A(St) also converges to the 
Gumbel distribution after proper centering and rescaling. Note that the 
Gumbel convergence (10), where the maximum is taken over the entire in- 
terval 9 £ [— vr, vr] , has a different centering term from the one in (8) which 
is obtained over Fourier frequencies. 

If y is a p X n random matrix consisting of i.i.d. entries, Geman (1980) 
and Yin, Bai and Krishnaiah (1988) proved a strong convergence result for 
the largest eigenvalues of Y~^Y, in the paradigm where n,p^ 00 such that 
p/n — )■ c G (0,00). See also Bai and Silverstein (2010) and references therein. 
If in addition the entries of Y are i.i.d. complex normal or normal random 
variables, Johansson (2000) and Johnstone (2001) presented an asymptotic 
distributional theory and showed that, after proper normalization, the lim- 
iting distribution of the largest eigenvalue follows the Tracy-Widom law 
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[Tracy and Widom (1994)]. Again, their methods depend heavily on the 
setup that there are i.i.d. copies of a random vector with independent en- 
tries, and/or the normahty assumption, so they are not apphcable here. 
Bryc, Dembo and Jiang (2006) studied the hmiting spectral distribution 
(LSD) of random Toeplitz matrices whose entries on different sub-diagonals 
are i.i.d. Solo (2010) considered the LSD of sample covariances matrices 
generated by a sample which is temporally dependent. 

Proof of Theorem 2. For notational simplicity we let / : = mine /(^) 
and / := maxg f{9). It follows immediately from (8) that for any 6 > 



(11) lim P max/r(6') > 2(1 - <5)7r/logT 

T— >oo 



1. 



The result in (8) is not sufficient to yield an upper bound of maxg It{9)- For 
this purpose we need to consider the maxima over a finer grid and then use 
Lemma 3 to extend to the maxima over the whole real line. Set Jt = [T log T\ 
and 9s '■= Ot,s = tts/jt for < s < jx- Define uit = [T^\ for some < /? < 1, 
and Xt = nt-mT^t = E{Xt\et-m,r,et-mT+i, ■■■)■ Let St{0) = Ya^i ^te'^^ be 
the Fourier transform of {Xt)i<t<T, and St{9) = T.Li^te'^'^ for the 
dependent sequence {Xt)i<t<T- By Lemma 3.4 of Lin and Liu (2009), we 
have 

(12) max T-'/^\St{9s) - St{Os)\ = op{{logT)-'/''). 

0<s<3t 

Now partition the interval [1,T] into blocks 81,82, ■■ ■ ,Bwj- of size m^, where 
Wt = [T/ititI ) ™d the last block may have size < \13\wj, < 2mT- Define 
the block sum UT,k{9) = J2teBk ^i^**^ fo^' I < k < wt- Choose /3 > smah 
enough such that for some < 7 < 1/2, the inequality 

(13) l-/3 + /3p-7(p-l)-l/2<0 

holds. We use truncation and define UT,k{9) = IEo[C/T,fe(^)l|;7T fc{e)|<T^]- Us- 
ing similar arguments as equation (5.5) [Lin and Liu (2009)] and (13), we 
have 



(14) max 

0<S<JT 



J^[f/T,fc(^s) - UT,k{9s)] 



k=l 



Op ((log T)' 



-1/2N 



Observe that U T,ki (9) and UT,k2 (^) are independent if — A;2| > 1- Let ^{z) 
denote the real part of a complex number z. Split the sum '}2lt=i^T,k{9) 
into four parts 

RT,m= Yl ^(UT,k{e)), Rt,2{0)= '^(pT,k{9)) 

k odd k even 
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and -Rt,35 -Rt,4 similarly for the imaginary part of UT,k- Since K\UT,k{0)\'^ < 
lE|C/r,fc(6')P < l-SfclGg, by Bernstein's inequality [cf. Freedman (1975)], 



supP 



< 2exp 



(9/8) log T 



1 + 862 V2 log TT-r- 1/2 



for 1 < / < 4. It follows that 



(15) 



lim P 

T-5>oo 



max 

0<S<jT 



k=l 



>3G2v/riogT 



0. 



Combining (12), (14) and (15), we have 



(16) 



hm P 

T-5.00 



max Iries) < 9.59^ log T 

0<S<jT 



which together with Lemma 3 implies that 



(17) 



lim P 

T-5.00 



max It (6*) < lOB^logT 



1. 



The upper bound in Theorem 2 is an immediate consequence in view of 
Lemma 1. For the lower bound, we use the inequality 

A(St) > ma^{T-^p{e)*^Tp{0)}, 

6 

where p{9) is defined in (5), and p{9)* is its Hermitian transpose. Note that 

T 



-ite 



T 

^ r ^/T(u;)e"*(^"*)'^e'(^"*)^da; 



t=i 



1 

2^ 



/t(w + ^) 



t=i 



doj. 



By Bernstein's inequality on the derivative of trigonometric polynomials [cf. 
Zygmund (2002), Theorem 3.13, Chapter X], we have 

max\I^{6)\ < T ■ max/'r(^). 
6 6 

Let 00 = argmaxe It{0). Set c= (1 - (5)7r//(10e^). By Lemma 1 and (40), 
we know 27r7 < 6^, and hence c < 1/20. If It{Oo) > 2(1 - (5)7r/logr and 
uiaxg IxiO) < 1002 logT, then for \w\ < c/T, we have 

hieo + io)> [2(1 - 6)7rf - Wc@l] log T = (1 - 6)7rf log T. 
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Since \ YlJ=ie'^'^? > WT'^/ll when l-u;] < c/T, it follows that 

1 1 DT^ ?r 

piOoT^TPieo) > ^ • (1 - 5)7r/iogr • . - 

_ TT{l-6ffT\ogT 

iiei 

which implies that A(St) ^ t^O- ~ "5)^/^ logT/(1102)- The proof is completed 
by selecting 5 small enough. □ 

Remark 1. In the proof, as well as many other places in this article, we 
often need to partition an integer interval [s,t] C N into consecutive blocks 
Bi,. . . ,Bb with the same size m. Since s — t + 1 may not be a multiple of m, 
we make the convention that the last block has the size m < \Bb\ < 2m, 
and all the other ones have the same size m. 



3. Banded covariance matrix estimates. In view of Lemma 1, the incon- 
sistency of St is due to the fact that the periodogram It {6) is not a consis- 
tent estimate of the spectral density f{6). To estimate the spectral density 
consistently, it is very common to use smoothing. In particular, consider the 
lag window estimate 

(18) f^j,^{^e) = — Kik/BT)%cos{ke), 

where Bt is the bandwidth satisfying natural conditions Bt — ?• oo and 
Bt/T —7- 0, and K{-) is a symmetric kernel function satisfying 

ii:(0) = l, \K{x)\<l and K(x) = for jx] > 1. 

Correspondingly, we define the tapered covariance matrix estimate 

tT,Br = [K{{S - t)/BT)ls-t]i<s,t<T = * Wt, 

where * is the Hadamard (or Schur) product, which is formed by element- 
wise multiplication of matrices. The term "tapered" is consistent with the 
terminology of the high-dimensional covariance regularization literature. 
However, the reader should not confuse this with the notion "data taper" 
that is commonly used in time series analysis. Our tapered estimate par- 
allels a lag-window estimate of the spectral density with a tapered win- 
dow. As a special case, if K{x) = l||2;|<i} is the rectangular kernel, then 

T,t,Bt = ils-t'^{\s-t\<BT}) is the banded sample covariance matrix. However, 
this estimate may not be nonnegative definite. To obtain a nonnegative def- 
inite estimate, by the Schur product theorem in matrix theory [Horn and 
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Johnson (1990)], since St is nonnegative definite, their Schur product ^t,Bt 
is also nonnegative definite if Wt = [K{{s — t)/ BT)]i<s,t<T is nonnegative 
definite. The Bartlett or triangular window Kb{u) = max(0, 1 — |n|) leads 
to a positive definite weight matrix Wt in view of 

(19) Kb{u)= I w{x)w{x + u)dx, 

Jr 

where w{x) = l{|a;|<i/2} is the rectangular window. Any kernel function hav- 
ing form (19) must be positive definite. 

Here we shall show that Tjt,Bt is a consistent estimate of and estab- 
lish a convergence rate of \{Yjt,Bt ~ ^)- We first consider the bias. By the 
Gersgorin theorem [cf. Horn and Johnson (1990), Theorem 6.1.1], we have 

A(ESt,Bt ~ ^) ^ ^t, where 



(20) 



Bt r . , . . Bt T-l 

k=l L \ / J k=BT+l 



7fcl 

Br+i 



The first term on the right-hand side in (20) is due to the choice of the kernel 
function, whose order of magnitude is determined by the smoothness of K{-) 
at zero. In particular, this term vanishes if K(-) is the rectangular kernel. 
If 1 - K{u) = 0{u^) at -u = and = 0{k-^), /3 > 1, then br = 0{b].-'^) 
if 1< /3 < 2, 6t = 0{B]r'^ + T^i) if 2 < /3 < 3 and 6t = 0{B;^^ + T~^) if 
/3 > 3. Generally, if YlT=i \^k\ < oo, then 6^ — ^ as Bt — )■ oo and Bt < T. 

It is more challenging to deal with X{T,t,Bt — ^^t,Bt)- If -^t G for 
some 2 <p <A and EXt = 0, Wu and Pourahmadi (2009) obtained 

/ BtQI 

(21) A(St,b, - EJ:T,Br) = Op (^^^ 

The key step of their method is to use the inequality 

Bt 

A(St,b^ - EtT,Br) <2j^\K{k/BTmk - E7fc|, 

fc=0 

which is also obtained by the Gersgorin theorem. It turns out that the above 
rate can be improved by exploiting the Toeplitz structure of the autocovari- 
ance matrix. By Lemma 1, 

(22) XitT,BT - IESt.Bt) < 27rmax|/T,B^(^) - E/t,b^(0)|. 

6 

Since /t,_Bt(^) i^ ^ trigonometric polynomial of order Bt, we can bound its 
maximum by the maximum over a fine grid. The following lemma is adapted 
from Zygmund (2002), Theorem 7.28, Chapter X. 
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Lemma 3. Lei S'(x) = ^qq + X^fc=i[ofc cos(A;x) + 6^ sin(A;x)] be a trigono- 
metric polynomial of order n. For any x* GM, 6 > and I > 2(1 + 6)n, let 
Xj = X* + 2-11 j /I for < j <l; then 

max|5(x)| < ( 1 + 6~^) max \ S(xi)\. 

X ' ^ " o<j<i 

For 6>0, let Oj = TTj/{\{l + 6)Bt]) for < j < [(1 + 6)Bt]; then by 
Lemma 3, 

(23) max|/T,B^(^) - E/t,s^(0)| < (1 + r^) max|/T,B^(ej) - E/t,s^(0,)|. 



Theorem 4. Assume Xt G with some p > 4, EXj = 0, and Qp{m) 
0{m~"). Choose the banding parameter Bt to satisfy Bt — )• oo, and Bt 
0{T^), for some 

(24) 0<7<1, 7<ap/2 and (1 - 2a)7 < (p - 4)/p. 

Then for bx defined in (20), and Cp = {p + 4)eP/^0|, 



(25) lim P 

T— >oo 



A(St,b^ - St) < I2cp\l ^T^^'^^T ^ 



In particular, i/ K(x) = l||^|<i}. and Bt ^ {T / \ogTY/^'^°'~^^\ then 



logT 



T 



(26) A(St,b^ - St) = Op 

Proof. In view of (20), to prove (25) we only need to show that 
(27) 



lim P 



X{Y^t,Bt - EST,By) < I2cj_ 



Bt log Bt 



1. 



By (22) and (23) where we take (5 = 1, the problem is reduced to 



(28) lim P 



(27r) • max|/T,B^(ej) - E/t,b^(0, )| < 6c, 



Bt log Bt 
T 



By Theorem 10 (where we take M = 2), for any 7 < /? < 1, there exists 
a constant Cp^p such that 



maxP 
3 



(27r) • |/t,Bt(^j) - ^fTM&j)\ > 6% 



i^T log Bt 



(29) < CpR{TBT)-^/^{\ogT)[{TBTf'^T-"PP/'^ + Tfi 



p/2-l-o/3p/2 



+ Cp^pBj. . 
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If (24) holds, there exist a < /3 < 1 such that 7 - afip/2 < and {p/A 
a(3p/2)j - {p/A - 1) < 0. It foUows that by (29), 



P 



ma^lfrMSj) " 1E/T,ij^(%)| > 6cp\/ ^^^^^ 



< Cp,^(logr)[T^-"^P/2 ^ 2^1-p/4 ^ ^{p/4-a/3p/2)7-{p/4-l)] ^ Cp,/?^^^ 
= 0(1). 

Therefore, (28) holds and the proof of (25) is complete. The last state- 
ment (26) is an immediate consequence. Details are omitted. □ 

Remark 2. In practice, EXi is usually unknown, and we estimate it 
by Xt = T-i ZLi ^t. Let 7^ = T'^ ZLk+ii^t-k - Xt)(Xj - Xt), and 
^tBt defined as ^t,Bt by replacing 7^ therein by 7^. Since — EXi = 
Op(T~^/^), it is easily seen that A(St,Bt ~ ^t_Bt) ~ Op{Bt/T). Therefore, 
the results of Theorem 4 hold for as well. 

Remark 3. In the proof of Theorem 4, we have shown that, as an 
intermediate step from (28) to (27), 



(30) lim P max \fTM^) " IE/t,Bt(^)I < ^Stt-^ Cpy/T^^B^^ogBr 



.0<6»<27r 



1. 



The above uniform convergence result is very useful in spectral analysis of 
time series. Shao and Wu (2007) obtained the weaker version 



^max^ UtM^) - EfT,Brm = Opiy/T-^BrhgBT) 

under a stronger assumption that Qp{m) = 0{p^) for some < p < 1. 

Remark 4. For linear processes, Woodroofe and Van Ness (1967) de- 
rived the asymptotic distribution of the maximum deviations of spectral 
density estimates. Liu and Wu (2010) generalized their result to nonlinear 
processes and showed that the limiting distribution of 



max 



T \fT,BA^j/BT) - KfrM^j/BT) 




0<j<BT V Bt f{TTj/BT) 

is Gumbel after suitable centering and rescaling, under stronger conditions 
than (24). With their result, and using similar arguments as Theorem 2, we 
can show that for some constant Cp, 



lim P 



1, 



which means that the convergence rate we have obtained in (27) is optimal. 
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Remark 5. The convergence rate ^JT ^BxlogBx + bT in Theorem 4 
is optimal. Consider a process (Xt) which satisfies 70 = 3 and when k > 0, 

A-"'^ if fc = for some j G N, 
0, otherwise, 

where a > and A > is an even integer such that A^°' < 1/5. Consider 
the banded estimate Tit,Bt with the rectangular kernel. As shown in the 
supplementary article [Xiao and Wu (2012)], there exists a constant C > 
such that 



(31) lim P 



\{T.T,Br - St) > C\l El}^^ + br/b 



1, 



suggesting that the convergence rate given in (25) of Theorem 4 is optimal. 
This optimality property can have many applications. For example, it can 
allow one to derive convergence rates for estimates of ay in the Wiener- 
Hopf equation, and the optimal weights = (ci, . . . , ct)~^ in the best linear 
unbiased estimation problem mentioned in the Introduction. 

Remark 6. We now compare (21) and our result. For p = 4, (21) gives 
the order X{T,t,Bt ~ ^^t,Bt) = Op{Bt/VT)- Our result (27) is sharper by 
moving the bandwidth Bt inside the square root. We pay the costs of a loga- 
rithmic factor, a higher order moment condition (p > 4), as well as conditions 
on the decay rate of tail sum of physical dependence measures (24). Note 
that when a > 1/2, the last two conditions of (24) hold automatically, so 
we merely need < 7 < 1, allowing a very wide range of Bt- In comparison, 
for (21) to be useful, one requires Bt = o{T^~'^^'P). 

Remark 7. The convergence rate (21) of Wu and Pourahmadi (2009) 
parallels the result of Bickel and Levina (2008b) in the context of high- 
dimensional multivariate analysis, which was improved in Cai, Zhang and 
Zhou (2010) by constructing a class of tapered estimates. Our result parallels 
the optimal minimax rate derived in Cai, Zhang and Zhou (2010), though 
the settings are different. 

Remark 8. Theorem 4 uses the operator norm. For the Frobenius norm 
see Xiao and Wu (2011) where a central limit theory for Ylk=i^k ^^^"^ 
Y^k=ii'yk - E7fc)^ is estabhshed. 

4. Thresholded covariance matrix estimators. In the context of time se- 
ries, the observations have an intrinsic temporal order and we expect that 
observations are weakly dependent if they are far apart, so banding seems to 
be natural. However, if there are many zeros or very weak correlations within 
the band, the banding method does not automatically generate a sparse es- 
timate. 
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The rationale behind the banding operation is sparsity, namely autoco- 
variances with large lags are small, so it is reasonable to estimate them 
as zero. Applying the same idea to the sample covariance matrix, we can 
obtain an estimate of by replacing small entries in St with zero. This 
regularization approach, termed hard thresholding, was originally developed 
in nonparametric function estimation. It has recently been applied by Bickel 
and Levina (2008a) and El Karoui (2008) as a method of covariance regular- 
ization in the context of high-dimensional multivariate analysis. Since they 
do not assume any order of the observations, their sparsity assumptions are 
permutation-invariant. Unlike their setup, we still use @p{m) [cf. (7)] and 

/ oo \ oo 

(32) ^p{m) = ^p(tf ' = J^min{Cp^'p(m),5p(t)} 

\t=m J t=0 

as our weak dependence conditions, where p' = min(2,p) and Cp is given 
in (38). This is natural for time series analysis. 

Let At = 2Cpy^log T/T, where is the constant given in Lemma 6. The 
thresholded sample autocovariance matrix is defined as 

Tt.At = {%~t^\%_t\>AT)l<s,t<T 

with the convention that the diagonal elements are never thresholded. We 
emphasize that the thresholded estimate may not be positive definite. The 
following result says that the thresholded estimate is also consistent in terms 
of the operator norm, and provides a convergence rate which parallels the 
banding approach in Section 3. In the proof we compare the thresholded 
estimate Tt,At with the banded one Tit,Bt some suitably chosen Bt- 
This is merely a way to simplify the arguments. The same results can be 
proved without referring to the banded estimates. 

Theorem 5. Assume Xt G with some p > 4, EXj = 0, and Qp{m) = 
0{m~°'), Ap{m) = 0{m~°'') for some a > a' > 0. // 

(33) a>l/2 or a'p>2, 
then 



log^^"/(2{l+"))^ 
T 



Remark 9. Condition (33) is only required for Lemma 6. As commented 
by Xiao and Wu (2011), it can be reduced to ap > 2 for linear processes. 
See Remark 2 of their paper for more details. 

The key step for proving Theorem 5 is to establish a convergence rate for 
the maximum deviation of sample autocovariances. The following lemma 
is adapted from Theorem 3 of Xiao and Wu (2011), where the asymptotic 
distribution of the maximum deviation was also studied. 
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Lemma 6. Assume the conditions of Theorem 5. Then 



where = 6(p + 4)eP/^||Xo 11494. 

Proof of Theorem 5. Let Bt = [(r/logr)i/(2(i+"))j, and ±t,Bt be 
the banded sample covariance matrix with the rectangular kernel. Recah 
that br = (2/T) ^f^^ A;|7fc| + 2^^^^^^^^ |7fc| from (20). By Lemma 6, we 
have 



(34) a(St,b^-St) = 0p(5ta/^+6t]. 



T 

Write the thresholded estimate Tt^At = ^t,At,i + ^t,At,2i where 

rr,^-r,l = i'ls-t^\^,_t\>AT,\s-t\<BT)l<s,t<T 

and 

rT,AT,2 = ils~t'i-\^,^t\>AT,\s~t\>BT)l<s,t<T- 

By Gersgorin's theorem, it is easily seen that 



(35) X{^t,At,i - ^t,Bt) < AtBt = O I 5ti 



T 



On the other hand, 

A(fT,AT,2)<2 |7fc-E7fc|l|^^.|<^^/2,|^,|>A:r 

\k=BT+l 

T T \ 

+ l7fc-^E7fe|l|7fe|>AT/2,|7fe|>AT+ Y l^^fcl 

k=BT+l k=BT+l / 

= :2{It + IIt + IIIt)- 
The term IIIt is dominated by bx- By Lemma 6, we know 

(36) lim P(/t > 0) < lim Pf max 1% - E^l > ^^72^ 0. 

For the remaining term II t, note that the number of jk such that k > Bt 
and |7fc| > At/2 is bounded by 2^^^^^_^-|^ |7fc|/j42-; therefore by Lemma 6 

(37) IIt < CiB-^'/AT) • ^^max_^ 1% - E%\ = Op(5-"). 
Putting (34), (35), (36) and (37) together, the proof is complete. □ 
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Remark 10. If the mean ¥,Xi is unknown, we need to replace 7^ by 7^ 
(Remark 2). Since Lemma 6 still holds when 7^ are replaced by 7^ [Xiao 
and Wu (2011)], Theorem 5 remains true for 7^. 

5. A simulation study. The thresholded estimate is desirable in that it 
can lead to a better estimate when there are a lot of zeros or very weak 
autocovariances. Unfortunately, due to technical difficulties, the theoretical 
result (cf. Theorem 5) does not reflect this advantage. We show by simula- 
tions that thresholding does have a better finite sample performance over 
banding when the true autocovariance matrix is sparse. 

Consider two linear processes Xt = X^^o^s^*-^ ~ Ss^o^«^*-«' 

where uq = bo = 1, and when s > 

Qs — CS ^ \ hg — ci^s/l") ^ is even 

for some c> and a > 0; and e^'s are taken as i.i.d. M{0, 1). Let 7^, S^, 
and , denote the autocovariances and autocovariance matrices of the 
two processes, respectively. It is easily seen that 7^ = if is odd. In fact, 
(Yj) can be obtained by interlacing two i.i.d. copies of (Xt). For a given set of 
centered observations Xi,X2, ■ ■ ■ , Xt, assuming that its true autocovariance 
matrix is known, for a fair comparison we choose the optimal bandwidth Bj- 
and threshold Aj- as 

if = argmin A(ff;-Sf), i^f = argminA(Sf ^ - Sf ). 

The two parameters for the (Yt) process are chosen in the same way. In 
all the simulations we set c = 0.5. For different combinations of the sample 
size T and the parameter a which controls the decay rate of autocovariances, 
we report the average distances in term of the operator norm of the two 
estimates Tij. and F^^^ from Sy, as well as the standard errors based on 

1000 repetitions. We also give the average bandwidth of S^^^. Instead of 

reporting the average threshold for F^^^, we provide the average number 
of nonzero autocovariances appearing in the estimates, which is comparable 
to the average bandwidth of . 

From Table 1, we see that for the process {Xt), the banding method out- 
performs the thresholding one, but the latter does give sparser estimates. For 
the process (Yt), according to Table 2, we find that thresholding performs 
better than banding when the sample size is not very large (T = 100,200), 
and yields sparser estimates as well. The advantage of thresholding in er- 
ror disappears when the sample size is 500. Intuitively speaking, banding is 
a way to threshold according to the truth (autocovariances with large lags 
are small), while thresholding is a way to threshold according to the data. 
Therefore, if the autocovariances are nonincreasing as for the process (Xt), 
or if the sample size is large enough, banding is preferable. If the autoco- 
variances do not vary regularly as for the process (Yt) and the sample size is 
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Table 1 

Errors under operator norm for (Xt) 





T = 


100 


T = 


200 


T = 


500 




Error 


BW 


Error 


BW 


Error 


BW 


0.2 


2.94 (1.17) 
3.66 (1.07) 
6.98 (2.63) 


9.55 (6.60) 
5.40 (4.87) 


3.01 (1.22) 
3.88 (1.14) 
8.12 (2.85) 


13.4 (7.67) 
7.39 (5.81) 


2.96 (1.23) 
4.08 (1.17) 
10.57 (3.93) 


23.4 (13.1) 

12.5 (10.1) 


0.5 


1.52 (0.68) 
1.90 (0.64) 
5.55 (2.37) 


6.31 (4.58) 
3.49 (2.56) 


1.38 (0.60) 
1.89 (0.59) 
6.73 (2.91) 


8.46 (5.57) 
4.15 (3.07) 


1.15 (0.50) 
1.74 (0.54) 
8.88 (3.28) 


11.9 (7.67) 
5.15 (3.27) 


1 


0.82 (0.39) 
1.03 (0.38) 
4.80 (2.14) 


4.04 (2.33) 
2.24 (0.87) 


0.69 (0.32) 
0.95 (0.32) 
6.05 (2.25) 


4.62 (2.47) 
2.29 (0.74) 


0.52 (0.24) 
0.81 (0.29) 
7.81 (2.64) 


5.68 (3.06) 
2.58 (0.83) 



"Error" refers to the average distance between the estimates and the true autocovariance 
matrix under the operator norm, and "BW" refers to the average bandwidth of the banded 
estimates, and the average number of nonzero sub-diagonals (including the diagonal) for 
the thresholded ones. The numbers 0.2, 0.5 and 1 in the first column are values of a. 
For each combination of T and a, three lines are reported, corresponding to banded 
estimates, thresholded ones and sample autocovariance matrices, respectively. Numbers 
in parentheses are standard errors. 

moderate, thresholding is more adaptive. As a combination, in practice we 
can use a thresholding-after-banding estimate which enjoys both advantages. 

Apparently our simulation is a very limited one, because we assume 
that the true autocovariance matrices are known. Practitioners would need 
a method to choose the bandwidth and/or threshold from the data. Although 
theoretical results suggest convergence rates of banding and thresholding 



Table 2 

Error under operator norm for (Yt) 





T = 


100 


T = 


200 


T = 


500 




Error 


BW 


Error 


BW 


Error 


BW 


0.2 


3.33 (0.86) 
3.15 (0.93) 
7.21 (4.28) 


9.87 (6.89) 
3.95 (3.50) 


3.54 (0.95) 
3.43 (1.00) 
8.69 (4.79) 


13.7 (7.67) 
5.69 (4.72) 


3.61 (1.07) 
3.75 (1.08) 
11.1 (5.31) 


24.7 (13.1) 
9.23 (8.04) 


0.5 


1.98 (0.61) 
1.81 (0.60) 
5.88 (3.27) 


7.26 (5.32) 
2.93 (2.41) 


1.88 (0.59) 
1.81 (0.59) 
7.25 (3.59) 


9.95 (6.44) 
3.44 (2.22) 


1.63 (0.53) 
1.71 (0.54) 
9.25 (3.72) 


16.3 (10.1) 
4.64 (2.97) 


1 


1.19 (0.41) 
1.02 (0.39) 
5.09 (2.77) 


5.31 (3.33) 
2.16 (0.65) 


1.01 (0.35) 
0.92 (0.32) 
6.39 (2.79) 


6.20 (3.58) 

2.21 (0.57) 


0.79 (0.28) 
0.80 (0.28) 
8.18 (2.91) 


8.28 (4.95) 
2.52 (0.77) 
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parameters which lead to optimal convergence rates of the estimates, they 
do not offer much help for finite samples. The issue was addressed by Wu 
and Pourahmadi (2009) incorporating the idea of risk minimization from 
Bickel and Levina (2008b) and the technique of subsampling from Politis, 
Romano and Wolf (1999), and by McMurry and Politis (2010) using the rule 
introduced in Politis (2003) for selecting the bandwidth in spectral density 
estimation. An alternative method is to use the block length selection proce- 
dure in Biihlmann and Kiinsch (1999) which is designed for spectral density 
estimation. We shall study other data-driven methods in the future. 

6. Moment inequalities. This section presents some moment inequalities 
that will be useful in the subsequent proofs. In Lemma 7, the case 1 < p < 2 
follows from Burkholder (1988) and the other case p > 2 is due to Rio (2009). 
Lemma 8 is adopted from Proposition 1 of Xiao and Wu (2011). 

Lemma 7 [Burkholder (1988), Rio (2009)]. Letp>l andp' = mm{p, 2}; 
let Dt, I <t<T, be martingale differences, and Dt G for every t. Write 
MT = El=iDt. Then 

T 

(38) ||Mt||^'<<^||A||^' where 
t=i 



[{P- 



1, 



ifl<p<2, 
ifp>2. 



It is convenient to use m-dependence approximation for processes with 
the form (6). For t E Z, define Ft = {et,et+i, ■ ■ ■) be the cr-field generated 
by the innovations £(,£(+1, . . . , and the projection operator 'Ht{-) = E(-|J"t) 
and Pf(-) = T-Lt{-) — T-lt+i{-)- Observe that {T'-t{-))t£Z is a martingale dif- 
ference sequence with respect to the filtration {J--t)- For m > 0, define 
Xt = T~it~mXt] then \\Xt — Xt\\p < Cp^p{m + 1) [see Proposition 1 of Xiao 
and Wu (2011) for a proof], and {Xt)tez is an (m + l)-dependent sequence. 

Lemma 8. Assume KXt = andp > 1. For m > 0, define Xt = 'Ht-mXt = 
'K{Xt\Ft^m) ■ Let 6p{-) he the physical dependence measure for the sequence {Xt) 
Then 

(39) \\VM\p<^p{i) and 6p{t)<6p{t), 

oo 

(40) |7fc| < C2{k) where Cp{k) ■■=^Sp{j)6p{j + k), 



(41) 
(42) 



:,t=l 



Cs,tiXsXt - 7s-t) 

T 

J2ct{Xt-X, 



< ACp/2CpelBTVT when p>A, 



p/2 



t=l 



< CpArOpim + 1) when p>2, 
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where 

/ T \ 1/2 f ^ 1 

= X] I n "'^^ = max <^ m^ax ^ ^ , ^max^ X] f • 

\t=l / I s=l t=l ) 

7. Large deviations for quadratic forms. In this section we prove a re- 
sult on probabilities of large deviations of quadratic forms of stationary 
processes, which take the form 

Qt= ^ a,s^tXsXt. 

l<s<t<T 

The coefficients a^^t = CLT,s,t may depend on T, but we suppress T from 
subscripts for notational simplicity. Throughout this section we assume that 
sup,, j|as,t| < 1, and ag^t = when |s — t| > Bt, where Bt satisfies Bt — )■ oo, 
and Bt = 0{T^) for some < 7 < 1. 

Large deviations for quadratic forms of stationary processes have been 
extensively studied in the literature. Bryc and Dembo (1997) and Bercu, 
Gamboa and Rouault (1997) obtained the large deviation principle [Dembo 
and Zeitouni (1998)] for Gaussian processes. Gamboa, Rouault and Zani 
(1999) considered the functional large deviation principle. Bercu, Gamboa 
and Lavielle (2000) obtained a more accurate expansion of the tail prob- 
abilities. Zani (2002) extended the results of Bercu, Gamboa and Rouault 
(1997) to locally stationary Gaussian processes. In fact, our result is more 
relevant to the so-called moderate deviations according to the terminology of 
Dembo and Zeitouni (1998). Bryc and Dembo (1997) and Kakizawa (2007) 
obtained moderate deviation principles for quadratic forms of Gaussian pro- 
cesses. Djellout, Guillin and Wu (2006) studied moderate deviations of pe- 
riodograms of linear processes. Bentkus and Rudzkis (1976) considered the 
Cramer-type moderate deviation for spectral density estimates of Gaussian 
processes; see also Saulis and Statulevicius (1991). Liu and Shao (2010) 
derived the Cramer-type moderate deviation for maxima of periodograms 
under the assumption that the process consists of i.i.d. random variables. 

For our purpose, on one hand, we do not need a result that is as precise 
as the moderate deviation principle or the Cramer-type moderate deviation. 
On the other hand, we need an upper bound for the tail probability under 
less restrictive conditions. Specifically, we would like to relax the Gaussian, 
linear or i.i.d. assumptions which were made in the precedent works. Rudzkis 
(1978) provided a result in this fashion under the assumption of boundedness 
of the cumulant spectra up to a finite order. While this type of assumption 
holds under certain mixing conditions, the latter themselves are not easy 
to verify in general and many well-known examples are not strong mixing 
[Andrews (1984)]. We mean to impose alternative conditions through phys- 
ical dependence measures, which are easy to use in many applications [Wu 
(2005)]. Furthermore, our result can be sharper; see Remark 11. 



20 H. XIAO AND W. B. WU 

Our main tool is the m-dependence approximation. In the next lemma we 
use dependence measures to bound the £p norm of the distance between Qt 
and the m-dependent version Qt- The proof and a few remarks on the 
optimality of the result are given in the supplementary article [Xiao and Wu 
(2012)]. 

Lemma 9. Assume Xt £ CP with p>A, EXj = and 6p < oo. Let Xt = 
Ut-rriT^t and Qt = J2i<s<t<T"'s,tXsXt; then 

\\EoQt-^oQt\\p/2 

< 4Qp{mTf + n{p - 2)Qp^/TB^Qp{mT) 

+ (p- 2)yrB^[3ep(LmT/2j)Ap(mT) + ?,Qp{mT)^p{[mT /2\)]. 

The following theorem is the main result of this section. 

Theorem 10. Assume Xt^ Cp, p> A, EXt = 0, and Qp{m) = Oim,-"^) . 
SetCp = {p + A)ePl^el. For any M > 1, let xt = 2cpyJTMBT log Bt ■ Assume 
that Bt — ?• oo and Bt = 0{T"') for some < 7 < 1. Then for any 7 < /3 < 1, 
there exists a constant Cp^Mfi > such that 

P{W'oQt\>xt) 

< Cp,M,px-'''\\ogT)[{TBTY'^T-^^''l^ + Ti?^/2-i-«/3p/2 ^ 

Remark 11. Rudzkis (1978) proved that if p = 4A; for some G N, then 

P(|EoQt| > xt) < Cx-'''\tBtY/\ 

which can be obtained by using Markov inequality and (41) under our frame- 
work. The upper bound given in Theorem 10 has a smaller order of mag- 
nitude. We note that Rudzkis (1978) also proved a stronger exponential 
inequality under strong moment conditions. They required the existence of 
every moment and the absolute summability of cumulants of every order. 

Proof of Theorem 10. Without loss of generality, assume Bt <T'^ ■ 
For 7 < /3 < 1, let rriT = \ T^\ , Xt = Ht^rnT^t and 

Qt = ^ cLs^tXsXt. 

l<s<t<T 

By Lemma 9 and (41), we have 

P[|Eo(Qt -Qt)\> CpM^/^^TBT{logBT)] 

(43) 

< Cp^mXt^^^TBtT^^T-^^p/^ 
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Split into blocks . . . ,Bbj, of size 2mT, and define 

QT,k = ^ ^ as,tXsXt. 

By Corollary 1.7 of Nagaev (1979) and (41), we know for any M > 1, there 
exists a constant Cp^M,i3 such that 



P[\EoQt\ > Cpy/TMBrilogBT)] 



k=l 



<^p(|EoQT,fc|>^^^) + 

(44) 



Cp.M./jTmj^HmT^T)^/^ 



p.M,p 



+ C^exp<f 



(p + 4)2gp/204 
< J2 PiMT,k\ > XT/Cp^M,l3) + Cp^mABt'' + ^ 



fc=l 

By Lemma 11, we have 

(45) <Cp,Af,/3x7/'(logr) 

X [(T^BT)^^^r~"'^P/^ + 2-v^^P/2-l-"/3p/2 _^ ^v^j^ 

Combining (43), (44) and (45), the proof is complete. □ 

Lemma 11. Assume Xt G withp> 4, EXt = 0, andQp{m) = 0(m~"). 
If XT > satisfies T^^JTBt = o{xt) for some 6 > 0, then for any < (3 <1, 
there exists a constant Cp^s,i3 such that 

P{\EoQt\ > xt) < Cp,s,i^x-''^\logT) 

X [(TSr)P/42^-a^p/2 ^ ^^p/2-l-a/3p/2 ^ ^ 

Proof. For j > 1, define tutj = [T^^' \, Xtj = nt-mT,jXt and 

Qtj = 0,s^tXsjXtj. 

l<s<t<T 

Let Jt = \- log(log T)/(log /3)] . Note that 

"^tjt — -'^y Lemma 9 and (41), 

(46) P[\MQt - Qt,i)\ > xt/Jt] < CpA\ogTfl^x~^'\TBTY/^T~^^^l\ 
Let j'rp be the smallest j such that nriTj < Bt/4:. For 1 < j < j'rp, split [1,T] 
into blocks B\" , . . . ,Bl" , oi size St + w-Tj' • Define 

RT,j,b= ^ ^ a,s,tXs,jXtj and RT,j,b= '^s,t^s,j+i-^t,i+i- 
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By Corollary 1.6 of Nagaev (1979) and (41), we have for any C > 2 
r -r. n 

(47) P 



XT 



\^0iQT,j - QT,j+l)\ >77— 



< 



\Eo{RT,j,b-B^,,,,)\>-^ 

6=1 L 



T 



(48) 



+ 2 



64^62 e|rsTi2 



T 



C/4 



It is clear that for any M > 1, there exists a constant Cm,s,i3 such that the 
term in (48) is less than Cm,s,(3X^^^ ■ For (47), by Lemma 9 and (41) 



b=l 



|Eo(i2T,i,fe-i?T,,-,fe)l> 



XT 
CjT 



< Cp,pT{mT,,r' ■ (logT)V2 . a:-^/^ . {mT^jBTf'^ ■ m^^ 

< C^^hXtFI^ ■ (logT)i/2rS^/^ . {mT,j 



-ap/2 
J+1 



-l-a/3p/2 



Depending on whether the exponent p/4:— 1 — a(3p/2 is positive or not, the 
term [mTj)^^*~^~°'^^^'^ is maximized when j = 1 or j = j't — 1? respectively, 
and we have 



(49) 



6=1 L "'^ 



< Cp^pX-^^'^ ■ (logr)l/2 . [(^^^)p/42.-a/3p/2 ^ j.^^ 

Combining (46), (47), (48) and (49), we have shown that 

P(|EoQt| >xt) 
(50) < P(|EoQr,i' I ^ ^2^/2) + C^^map^t^' 



M,S,l3Xrp 



p/2 



(iogr)[(rsT)^/^r^"'^^/^ + r^- 



p/2-l-a/3p/2, 
T i 



To deal with the probability concerning Qt j' in (50), we split [1,T] into 
blocks Bi, . . . ,Bhj. with size 2i?T, and define the block sums 

Similarly as (47) and (48), there exists a constant Cp^M,s./3 > 2 such that 

Pi\EoQT.f\ > XT 12) < Vp(|Eoi?Tj' .61 > 77^^) + Cp^MAd^T^' ■ 
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By Lemma 12, we have 

and it follows that for some constant Cp^s,i3 > 0, 

(51) P{\EoQT,f^\ > xt/2) < Cp,s,px~''/\logT)T{Bf-'-'"'^/' + 1). 
The proof is completed by combining (50) and (51). □ 

In the next lemma we consider Qt when the restriction a^^t = for \ s — t\ > 
Bt is removed. To avoid confusion, we use a new symbol. Let 

R{T,m)= Cs,t{ns^mXs)int-mXt). 

l<s<t<T 

For XT > 0, define 

U{T,m,XT)= sup P[\EoR{T,m)\ >xt], 

{Cs,t} 

where the supremum is taken over all arrays {cg^t} such that |cs^t| < 1. We use 
Rt and U(T,xt) as shorthands for R{T, oo) and U(T,oo,xt), respectively. 

Lemma 12. Assume Xt G withp > 4, EXt = 0, and@p{m) = 0{m~°'). 
If XT > satisfies T^+^ = o{xt) for some S > 0, then for any < (3 < 1, there 
exists a constant Cp^s.p such that 

p{\EoRt\ > xt) < Cp,5,/3^?^/'(iogr)(r*'/2-"/^p/2 + T). 

Proof. Let tut = [T'^J and Rt := R{T,mT)- By Lemma 9 and (41), 
P[\Eo{Rt - Rt)\ > xt/2] < CpX^^'^^pP/^-^'^P/^. 
We claim that there exists a constant Cp^s,p such that 

(52) U{T,mT,XT/2) < Cp,s,^x-''/\TlogT){mf-'-''^P/^ + 1). 
Therefore, the proof is complete by using 

Pi\EQRT\>XT)<P[\Eo{RT-RT)\>XT/2] + UiT,mT,XT/2). 

We need to prove the claim (52). Let zt satisfy T^~^^ = o{zt)- Let Jt = 
[-log(logT)/(log/3)], and note that T^'^ < e. Set ?/t = ZT/{2jT). We con- 
sider U(T,m, Zt) for an arbitrary 1 <m< T/4. Set Xt^i ■.= 'Ht~mXt and 
Xt,2 ■='Ht-l7n0\Xt- Define 

3m— 1 t 

Yt,i = ^ Cs,tXs,i and Zt^i = ^ Cs,tXs,i 

s=l s=lV(t-3m) 
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and lj^2; ■^4,2 similarly by replacing Xg^i with Xs^2- Observe that Xt^k and Yt^i 



are independent for k,l = 1,2. We first consider Ylt=ii-^t,iZt,i — Xt^2Zt,2)- 
Split [1,T] into blocks Bi, . . . ^Bhj. with size 4m, and define WT,b = 
- ^i,2^t,2). Let 2/T satisfy yr < zt/2 and T^+^/^ = oiyr). 
Since WT,b and WT,b' are independent if |6 — 6'| > 1, by Corollary 1.6 of Na- 
gaev (1979), (41) and Lemma 9, similarly as (47) and (48), we know for any 
M > 1 , there exists a constant Cp^M,5,i3 such that 



P 




XiaZia — XtoZfo 



> yr 



(53) 



<Cr, 



'p,MA0T^ + J^P(|EoWT,f,| > yr/C 



6=1 



Now we deal with the term Ylt=ii-^t,iyt,i — -^t,2^f,2)- Split [l,T] into blocks 
Bl,...,Bl,^ with size m. Define i?T,fe = SteB* " -'^t,2^t,2)- Let be 

the cr-fields generated by {e;j^,e/j^_i, . . .}, where li, = max{S^}. Observe that 
{RT,b)b is odd is a martingale sequence with respect to {Cb)b is odd, and so are 
{RT,b)b is even and {S,b)b is even- By Lemma 1 of Haeusler (1984) we know for 
any M > 1, there exists a constant Cms such that 



P/22^^p/2-l-Q/3p/2^ 



(54) 



p 



t=i 



> VT 



EiE(^T,.l6-2)>7-^^ 

^ (logyr)^/^ 



+ |i?T,fe|> 
6=1 

=: It + IIt + UIt- 



yr 



logyr 



Since {Xt^i,Xt^2) and (lt,i, lt,2) are independent, R^fi has finite pth moment. 
Using similar arguments as Lemma 9, we have 



\RT,b\\p<CpimT)P/^ 



m 



-afip. 



and it follows that 

(55) IIIt < Cpy-^ (log yTYT^'^^^m- 



p/2-\-al3p 
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For the second term, let rs~t,k = ^iXs,kXt,k) for k = 1,2; we have 



(56) 



b=l 



6=1 '-s,teB* 



i<s<t<r 



l<s<t<T 



By (39) and (40), we know Xl^g^ |n,fc| < cxd for A; = 1,2, and hence |as,t,fc| < 
CT. It follows that the expectations of the two terms in (56) are all less 
than CT^, and 



(57) 



II T <CpU 



T,m, 



r(log yr)^ 



+ CpU 



T, [m^J, 



r(log yrY' 



Combining (53), (54), (55) and (57), we have shown that U{T,m,ZT) is 
bounded from above by 



U{T, Kj,ZT-2yT) 



(58) 



r, 



r,m. 



r(logyT)2 



r(logyT)2. 

+ y/(logyT)^TP/2+i,„p/2-i-"/3p]. 

Since supj^^ ||Eo-Rt|Ip/2 ^ CpT by (41), by applying (58) recursively to deal 
with the last term on the first line of (58) for q times such that {ut/T)' 
0[y. 



~2ip . 



-{M + l)n 



T 



we have 



(59) 



C/(r, m, zt) < Cp,M,sAU{T, lm^\ , zt - 2yT) + y-^/'Tm^/^-i-^/'P/^ 



Using the preceding arguments similarly, we can show that when 1 < m < 3 

c/[r,m,ZT/(2iT)]<C'M,p,5^>^/'(iogr)r + z^^(iogzT)^+'rp/2+i + ^-^'^]. 

The details of the derivation are omitted. Applying (59) recursively for at 
most Jt — 1 times, we have the first bound for U{T,m, zt), 

U{T,m,ZT) 

< Ci:^M,sAU[T, 3, ZT/{2jT)] + z-^^\log ZT)T{mP/'-^-^M2 + i) 

+ z/(logZT)^+'TP/2+l(,nP/2-l-"/3P + 1) + ^-A-^} 



(60) 



<C}:sA^ogZT 



T + z-*'rp/2+i)( 



m 



p/2-l-o/3p/2 



+ !)• 
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Now plugging (60) back into (58) for the last two terms on the first line and 
using the condition T^+^l'^ = o{yT), we have 

U{T, m, zt) < UiT, [m^\ , zt - 2yT) 

(61) 

+ C,,,,;3[y-^/^r(m^/2-i-/^^'/2 + l)]. 

Again by applying (61) for at most jx — 1 times, we obtain the second bound 
for U{T,m, Zt)'- 

U{T,m, zt) < Cp,5,^z-^/'(riogr)(mP/2-i-°/^P/2 + l). 

The proof of the claim (52) is complete. □ 

8. Conclusion. In this paper we use Toeplitz's connection of eigenval- 
ues of matrices and Fourier transforms of their entries, and obtain optimal 
bounds for tapered covariance matrix estimates by applying asymptotic re- 
sults of spectral density estimates. Many problems are still unsolved; for 
example, can we improve the convergence rate of the thresholded estimate 
in Theorem 5? What is the asymptotic distribution of the maximum eigen- 
values of the estimated covariance matrices? We hope that the approach 
and results developed in this paper can be useful for other high-dimensional 
covariance matrix estimation problems in time series. Such problems are rel- 
atively less studied compared to the well-known theory of random matrices 
which requires i.i.d. entries or multiple i.i.d. copies. 

Acknowledgments. We are grateful to an Associate Editor and the ref- 
erees for their many helpful comments. 

SUPPLEMENTARY MATERIAL 

Additional technical proofs (DOI: 10.1214/11-AOS967SUPP; .pdf). We 
give the proofs of Remark 5 and Lemma 9, as well as a few remarks on 
Lemma 9. 

REFERENCES 

Adenstedt, R. K. (1974). On large-sample estimation for the mean of a stationary 

random sequence. Ann. Statist. 2 1095-1107. MR0368354 
An, H. Z., Chen, Z. G. and Hannan, E. J. (1983). The maximum of the periodogram. 

J. Multivariate Anal. 13 383-400. MR0716931 
Andrews, D. W. K. (1984). Nonstrong mixing autoregressive processes. J. Appl. Probab. 

21 930-934. MR0766830 
Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random 

Matrices, 2nd ed. Springer, New York. MR2567175 
Bai, Z. D. and Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large-dimensional 

sample covariance matrix. Ann. Probab. 21 1275-1294. MR1235416 



ESTIMATION OF COVARIANCE MATRICES 



27 



Bentkus, R. and Rudzkis, R. (1976). Large deviations for estimates of the spectrum of 

a stationary Gaussian sequence. Litovsk. Mat. Sb. 16 63-77, 253. MR0436510 
Bercu, B., Gamboa, F. and Rouault, A. (1997). Large deviations for quadratic forms 

of stationary Gaussian processes. Stochastic Process. Appl. 71 75-90. MR1480640 
Bercu, B., Gamboa, F. and Lavielle, M. (2000). Sharp large deviations for Gaussian 

quadratic forms with apphcations. ESAIM Probab. Stat. 4 1-24 (electronic). MR1749403 
BiCKEL, p. ,J. and Levina, E. (2008a). Covariance regularization by thresholding. Ann. 

Statist. 36 2577-2604. MR2485008 
BiCKEL, P. J. and Levina, E. (2008b). Regularized estimation of large covariance matri- 
ces. Ann. Statist. 36 199-227. MR2387969 
Bryc, W. and Dembo, A. (1997). Large deviations for quadratic functionals of Gaussian 

processes. J. Theoret. Probab. 10 307-332. Dedicated to Murray Rosenblatt. MR1455147 
Bryc, W., Dembo, A. and Jiang, T. (2006). Spectral measure of large random Hankel, 

Markov and Toeplitz matrices. Ann. Probab. 34 1-38. MR2206341 
BuHLMANN, P. and KilNSCH, H. R. (1999). Block length selection in the bootstrap for 

time series. Comput. Statist. Data Anal. 31 295-310. 
BuRKHOLDER, D. L. (1988). Sharp inequalities for martingales and stochastic integrals. 

Colloque Paul Levy sur les Processus Stochastiques (Palaiseau, 1987). Asterisque 157- 

158 75-94. MR0976214 
Cai, T. T., Zhang, C.-H. and Zhou, H. H. (2010). Optimal rates of convergence for 

covariance matrix estimation. Ann. Statist. 38 2118-2144. MR2676885 
Dembo, A. and Zeitouni, O. (1998). Large Deviations Techniques and Applications, 2nd 

ed. Applications of Mathematics (New York) 38. Springer, New York. MR1619036 
Djellout, H., Guillin, A. and Wu, L. (2006). Moderate deviations of empirical pe- 

riodogram and non-linear functionals of moving average processes. Ann. Inst. Henri 

Poincare Probab. Stat. 42 393-416. MR2242954 
El Karoui, N. (2005). Recent results about the largest eigenvalue of random covariance 

matrices and statistical apphcation. Acta Phys. Polon. B 36 2681-2697. MR2188088 
El Karoui, N. (2008). Operator norm consistent estimation of large-dimensional sparse 

covariance matrices. Ann. Statist. 36 2717-2756. MR2485011 
Freedman, D. a. (1975). On tail probabilities for martingales. Ann. Probab. 3 100-118. 

MR0380971 

Furrer, R. and Bengtsson, T. (2007). Estimation of high-dimensional prior and pos- 
terior covariance matrices in Kalman filter variants. J. Multivariate Anal. 98 227-255. 
MR2301751 

Gamboa, F., Rouault, A. and Zani, M. (1999). A functional large deviations principle 
for quadratic forms of Gaussian stationary processes. Statist. Probab. Lett. 43 299-308. 
MR1708097 

Geman, S. (1980). A limit theorem for the norm of random matrices. Ann. Probab. 8 
252-261. MR0566592 

Grenander, U. and Szego, G. (1958). Toeplitz Forms and Their Applications. Univ. 
California Press, Berkeley. MR0094840 

Haeusler, E. (1984). An exact rate of convergence in the functional central limit the- 
orem for special martingale difference arrays. Z. Wahrsch. Verw. Gebiete 65 523-534. 
MR0736144 

Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis. Cambridge Univ. Press, 

Cambridge. Corrected reprint of the 1985 original. MR1084815 
Johansson, K. (2000). Shape fluctuations and random matrices. Comm. Math. Phys. 209 

437-476. MRl 737991 



28 



H. XIAO AND W. B. WU 



Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal com- 
ponents analysis. Ann. Statist. 29 295-327. MR1863961 

Kakizawa, Y. (2007). Moderate deviations for quadratic forms in Gaussian stationary 
processes. J. Multivariate Anal. 98 992-1017. MR2325456 

KOLMOGOROFF, A. (1941). Interpolation und Extrapolation von stationaren zufalligen 
Folgen. Bull. Acad. Sci. URSS Ser. Math. [Izvestia Akad. Nauk SSSR] 5 3-14. 
MR0004416 

Lin, Z. and Liu, W. (2009). On maxima of periodograms of stationary processes. Ann. 

Statist. 37 2676-2695. MR2541443 
Liu, W. and Shao, Q.-M. (2010). Cramer-type moderate deviation for the maximum of 

the periodogram with application to simultaneous tests in gene expression time series. 

Ann. Statist. 38 1913-1935. MR2662363 
Liu, W. and Wu, W. B. (2010). Asymptotics of spectral density estimates. Econometric 

Theory 26 1218-1245. MR2660298 
Marcenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues in certain sets 

of random matrices. Mat. Sb. (N.S.) 72 507-536. MR0208649 
McMuRRY, T. L. and Politis, D. N. (2010). Banded and tapered estimates for autoco- 

variance matrices and the linear process bootstrap. J. Time Series Anal. 31 471-482. 

MR2732601 

Nagaev, S. V. (1979). Large deviations of sums of independent random variables. Ann. 

Probab. 7 745-789. MR0542129 
Peligrad, M. and Wu, W. B. (2010). Central limit theorem for Fourier transforms of 

stationary processes. Ann. Probab. 38 2009-2022. MR2722793 
Politis, D. N. (2003). Adaptive bandwidth choice. J. Nonparametr. Stat. 15 517-533. 

MR2017485 

Politis, D. N., Romano, J. P. and Wolf, M. (1999). Subsamplmg. Springer, New York. 
MR1707286 

Rio, E. (2009). Moment inequalities for sums of dependent random variables under pro- 
jective conditions. J. Theoret. Probab. 22 146-163. MR2472010 

RUDZKIS, R. (1978). Large deviations for estimates of the spectrum of a stationary se- 
quence. Litovsk. Mat. Sb. 18 81-98, 217. MR0519099 

Saulis, L. and Statulevicius, V. A. (1991). Limit Theorems for Large Deviations. Math- 
ematics and Its Applications (Soviet Series) 73. Kluwer, Dordrecht. Translated and 
revised from the 1989 Russian original. MRl 171883 

Shao, X. and Wu, W. B. (2007). Asymptotic spectral theory for nonlinear time series. 
Ann. Statist. 35 1773-1801. MR2351105 

Solo, V. (2010) . On random matrix theory for stationary processes. In IEEE International 
Conference on Acoustics Speech and Signal Processing (ICASSP) 3758-3761. IEEE, 
Piscataway, NJ. 

TOEPLITZ, O. (1911). Zur Theorie der quadratischen und bilinearen Formen von un- 

endlichvielen Veranderlichen. Math. Ann. 70 351-376. MR1511625 
TONG, H. (1990). Nonlinear Time Series. Oxford Statistical Science Series 6. Oxford Univ. 

Press, New York. MR1079320 
Tracy, C. A. and Widom, H. (1994). Level-spacing distributions and the Airy kernel. 

Comm. Math. Phys. 159 151-174. MR1257246 
Turkman, K. F. and Walker, A. M. (1984). On the asymptotic distributions of maxima 

of trigonometric polynomials with random coefficients. Adv. in Appl. Probab. 16 819- 

842. MR0766781 

Turkman, K. F. and Walker, A. M. (1990). A stability result for the periodogram. 
Ann. Probab. 18 1765-1783. MR1071824 



ESTIMATION OF COVARIANCE MATRICES 



29 



Wiener, N. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Se- 
ries. With Engineering Applications. MIT Press, Cambridge, MA. MR0031213 

WOODROOFE, M. B. and Van Ness, J. W. (1967). The maximum deviation of sample 
spectral densities. Ann. Math. Statist. 38 1558-1569. MR0216717 

Wu, W. B. (2005). Nonlinear system theory: Another look at dependence. Proc. Natl. 
Acad. Sci. USA 102 14150-14154 (electronic). MR2172215 

Wu, W. B. and Pourahmadi, M. (2009). Banding sample autocovariance matrices of 
stationary processes. Statist. Sinica 19 1755-1768. MR2589209 

Xiao, H. and Wu, W. B. (2011). Asymptotic inference of autocovariances of stationary 
processes. Available at arXiv:1105.3423. 

Xiao, H. and Wu, W. B. (2012). Supplement to "Covariance matrix estimation for sta- 
tionary time series." DOI:10.1214/11-AOS967SUPP. 

Yin, Y. Q., Bai, Z. D. and Krishnaiah, P. R. (1988). On the limit of the largest 
eigenvalue of the large-dimensional sample covariance matrix. Probab. Theory Related 
Fields 78 509-521. MR0950344 

Zani, M. (2002). Large deviations for quadratic forms of locally stationary processes. 
J. Multivariate Anal. 81 205-228. MR1906377 

Zygmund, a. (2002). Trigonometric Series. Vols I, II, 3rd ed. Cambridge Univ. Press, 
Cambridge. MR1963498 

Department of Statistics 
University of Chicago 
5734 S. University Ave 
Chicago, Illinois 60637 
USA 

E-MAIL: xiao@gaIton.uchicago.edu 
wbwu@galton . uchicago . cdu 



